59 research outputs found
Statistical Methods for Integrative Analysis, Subgroup Identification, and Variable Selection Using Cancer Genomic Data
In recent years, comprehensive cancer genomics platform, such as The Cancer Genome Atlas (TCGA), provides access to an enormous amount of high throughput genomic datasets for each patient, including gene expression, DNA copy number alteration, DNA methylation, and somatic mutation. Currently most existing analysis approaches focused only on gene-level analysis and suffered from limited interpretability and low reproducibility of findings. Additionally, with increasing availability of the modern compositional data including immune cellular fraction data and high-dimensional zero-inflated microbiome data, variable selection techniques for compositional data became of great interest because they allow inference of key immune cell types (immunology data) and key microbial species (microbiome data) associated with development and progression of various diseases. In the first dissertation aim, we address these challenges by developing a Bayesian sparse latent factor model for pathway-guided integrative genomic data analysis. Specifically, we constructed a unified framework to simultaneously identify cancer patient subgroups (clustering) and key molecular markers (variable selection) based on the joint analysis of continuous, binary and count data. In addition, we applied Polya-Gamma mixtures of normal for binary and count data to promote an exact and fully automatic posterior sampling. Moreover, pathway information was used to improve accuracy and robustness in identification of cancer patient subgroups and key molecular features. In the second dissertation aim, we developed the R package InGRiD , a comprehensive software for pathway-guided integrative genomic data analysis. We further implemented the statistical model developed in Aim 1 and provide it as a part of this software. The third dissertation aim exploits variable selection in compositional data analysis with application to immunology data and microbiome data. Specifically, we identified key immune cell types by applying a stepwise pairwise log-ratio procedure to the immune cellular fractions data, while selecting key species in the microbiome data by using zero-inflated Wilcoxon rank sum test. These approaches consider key components specific to these data types, such as compositionality (i.e., sum-to-one), zero inflation, and high dimensionality, among others. The proposed methods were developed and evaluated on: 1) large scale, high dimensional, and multi-modal datasets from the TCGA database, including gene expression, DNA copy number alteration, and somatic mutation data (Aim 1); 2) cellular fraction data induced from Colorectal Adenocarcinoma TCGA Pan-Cancer study (Aim 3); 3) high dimensional zero-inflated microbiome data from studies of colorectal cancer (Aim 3)
Colloidal III–V Nitride Quantum Dots
Colloidal quantum dots (QDs) have attracted intense attention in both fundamental studies and practical applications. To date, the size, morphology, and composition-controlled syntheses have been successfully achieved in II–VI semiconductor nanocrystals. Recently, III-nitride semiconductor quantum dots have begun to draw significant interest due to their promising applications in solid-state lighting, lasing technologies, and optoelectronic devices. The quality of nitride nanocrystals is, however, dramatically lower than that of II–VI semiconductor nanocrystals. In this review, the recent development in the synthesis techniques and properties of colloidal III–V nitride quantum dots as well as their applications are introduced
EventEA: Benchmarking Entity Alignment for Event-centric Knowledge Graphs
Entity alignment is to find identical entities in different knowledge graphs
(KGs) that refer to the same real-world object. Embedding-based entity
alignment techniques have been drawing a lot of attention recently because they
can help solve the issue of symbolic heterogeneity in different KGs. However,
in this paper, we show that the progress made in the past was due to biased and
unchallenging evaluation. We highlight two major flaws in existing datasets
that favor embedding-based entity alignment techniques, i.e., the isomorphic
graph structures in relation triples and the weak heterogeneity in attribute
triples. Towards a critical evaluation of embedding-based entity alignment
methods, we construct a new dataset with heterogeneous relations and attributes
based on event-centric KGs. We conduct extensive experiments to evaluate
existing popular methods, and find that they fail to achieve promising
performance. As a new approach to this difficult problem, we propose a
time-aware literal encoder for entity alignment. The dataset and source code
are publicly available to foster future research. Our work calls for more
effective and practical embedding-based solutions to entity alignment.Comment: submitted to ISWC 202
Weighted AdaGrad with Unified Momentum
Integrating adaptive learning rate and momentum techniques into SGD leads to
a large class of efficiently accelerated adaptive stochastic algorithms, such
as Nadam, AccAdaGrad, \textit{etc}. In spite of their effectiveness in
practice, there is still a large gap in their theories of convergences,
especially in the difficult non-convex stochastic setting. To fill this gap, we
propose \emph{weighted AdaGrad with unified momentum}, dubbed AdaUSM, which has
the main characteristics that (1) it incorporates a unified momentum scheme
which covers both the heavy ball momentum and the Nesterov accelerated gradient
momentum; (2) it adopts a novel weighted adaptive learning rate that can unify
the learning rates of AdaGrad, AccAdaGrad, Adam, and RMSProp. Moreover, when we
take polynomially growing weights in AdaUSM, we obtain its
convergence rate in the non-convex stochastic
setting. We also show that the adaptive learning rates of Adam and RMSProp
correspond to taking exponentially growing weights in AdaUSM, which thereby
provides a new perspesctive for understanding Adam and RMSProp. Lastly,
comparative experiments of AdaUSM against SGD with momentum, AdaGrad, AdaEMA,
Adam, and AMSGrad on various deep learning models and datasets are also
provided
Deep Active Alignment of Knowledge Graph Entities and Schemata
Knowledge graphs (KGs) store rich facts about the real world. In this paper,
we study KG alignment, which aims to find alignment between not only entities
but also relations and classes in different KGs. Alignment at the entity level
can cross-fertilize alignment at the schema level. We propose a new KG
alignment approach, called DAAKG, based on deep learning and active learning.
With deep learning, it learns the embeddings of entities, relations and
classes, and jointly aligns them in a semi-supervised manner. With active
learning, it estimates how likely an entity, relation or class pair can be
inferred, and selects the best batch for human labeling. We design two
approximation algorithms for efficient solution to batch selection. Our
experiments on benchmark datasets show the superior accuracy and generalization
of DAAKG and validate the effectiveness of all its modules.Comment: Accepted in the ACM SIGMOD/PODS International Conference on
Management of Data (SIGMOD 2023
Lifelong Embedding Learning and Transfer for Growing Knowledge Graphs
Existing knowledge graph (KG) embedding models have primarily focused on
static KGs. However, real-world KGs do not remain static, but rather evolve and
grow in tandem with the development of KG applications. Consequently, new facts
and previously unseen entities and relations continually emerge, necessitating
an embedding model that can quickly learn and transfer new knowledge through
growth. Motivated by this, we delve into an expanding field of KG embedding in
this paper, i.e., lifelong KG embedding. We consider knowledge transfer and
retention of the learning on growing snapshots of a KG without having to learn
embeddings from scratch. The proposed model includes a masked KG autoencoder
for embedding learning and update, with an embedding transfer strategy to
inject the learned knowledge into the new entity and relation embeddings, and
an embedding regularization method to avoid catastrophic forgetting. To
investigate the impacts of different aspects of KG growth, we construct four
datasets to evaluate the performance of lifelong KG embedding. Experimental
results show that the proposed model outperforms the state-of-the-art inductive
and lifelong embedding baselines.Comment: Accepted in the 37th AAAI Conference on Artificial Intelligence (AAAI
2023
- …